Column

Overview Visualizing in higher dimension space can be messy and unintuitive (Euclidean space, \(\mathbb{R}^p,~~p>3\), where p are numeric variables). Analysis of higher dimensions must be interpretable in terms of the original dimensions and minimizes the loss of the information held in the data.

To these ends we advise the use of projection pursuit as acheived in the R package tourr(2011, H Wickham & D Cook). Furthere, we impliment a method for manual controls following D. Cook, & A. Buja (1997) in an R package spinifex, currently available with devtools::install_github("nspyrison/spinifex"). We also compare and contrast alternative methodolgy; namely Principal Component Analysis (PCA, 1901 K. Person), t-distributed Stochastic Neighbor Embedding (t-SNE, 2008 L van derMaaten & G Hinton), and holes ompimized tour (an application of projection pursuit, 1974 J Friedman & J Tukey). Grand Tour purposed D Asimov (1985).

The R package, tourr (2011, H Wickham & D Cook), gives a means to animate 2-d projections of rotated p-dimensional data object. The path of rotation may take the form of a random walk, predefined path, or optimizing an index by (“semi-”stochastic) gradient descent (Projection Pursuit, described above).

\(Work~in~progress,~~TODO:~add~to,~cleanup\)

Thanks

Prof. Dianne Cook - Guidance, inspiration, and contributions to projection pursuit

Dr. Ursula Laa - Collaboration, use cases, and development feedback

References

H. Wickham, D. Cook, H. Hofmann, and A. Buja (2011). tourr: An r package for exploring multivariate data with projections. Journal of Statistical Software 40(2), http://www.jstatsoft.org/v40.

D. Asimov (1985). The grand tour: a tool for viewing multidimensional data. SIAM Journal on Scientific and Statistical Computing, 6(1), 128–143.

D. Cook, & A. Buja (1997). Manual Controls for High-Dimensional Data Projections. Journal of Computational and Graphical Statistics, 6(4), 464–480. https://doi.org/10.2307/1390747

H. Wickham, D. Cook, and H. Hofmann (2015). Visualising statistical models: Removing the blindfold (withdiscussion). Statistical Analysis and Data Mining 8(4), 203–225.

Other reading

Principal Component Analysis —— t-distributed Stochastic Neighbor Embedding
Projection pursuit —— Grand Tour —— Spinifix Hopping Mouse

Column

Spinifex

Data - flea

74 obs x 6 var of physical measurements taken across 3 different species of flea-beetles. Methods are unsupervized, but data are colored according to species.

\(TODO:~scale~output~of~spinifex::proj\_data(),~case~handling~for~spinifex::slideshow(),~apply~Phys~data.\)

\(TODO:~FIX~SPINIFEX~HERE\)

Tourr

Methodology comparison

Data - flea

74 obs x 6 var of physical measurements taken across 3 different species of flea-beetles. Methods are unsupervized, but data are colored according to species.

Methods

  • Principle component analysis (PCA): p ordered linear combinations of p dimensions
  • t-distributed Stochastic Neighbor Embedding (t-SNE): p unordered non-linear combinations of p dimensions
  • Tour (Holes optimized): stochastic gradient opmitization of white space in the middle of a 2 dimensional projection
Method Interpretable MaxVarRetention GlobalOptimia CannotOverfit NonLinearData
PCA TRUE FALSE TRUE TRUE FALSE
t-SNE FALSE NA FALSE FALSE TRUE
Tour, holes TRUE TRUE FALSE TRUE FALSE

f.pca <- stats::prcomp(flea)
ggplot2::ggplot(f.pca) + ...

f.tsne <- Rtsne(f, ...)
f.tsne.pca <- stats::prcomp(f.tsne)
ggplot2::ggplot(f.tsne.pca) + ...

f.holes_end <- tourr::animate_xy(flea, guided_tour(index = holes))
ggplot2::ggplot(f.holes_end) + ...

Variation lost from dimension reduction